Combining multiple speech recognizers using voting and language model information
نویسندگان
چکیده
In 1997, NIST introduced a voting scheme called ROVER for combining word scripts produced by different speech recognizers. This approach has achieved a relative word error reduction of up to 20% when used to combine the systems’ outputs from the 1998 and 1999 Broadcast News evaluations. Recently, there has been increasing interest in using this technique. This paper provides an analysis of several modifications of the original algorithm. Topics addressed are the order of combination, normalization/filtering of the systems’ outputs prior to combining them, treatment of ties during voting and the incorporation of language model information. The modified ROVER achieves an additional 5% relative word error reduction on the 1998 and 1999 Broadcast News evaluation test sets. Links with recent theoretical work on alternative error measures are also discussed.
منابع مشابه
Improved ROVER using Language Model Information
In the standard approach to speech recognition, the goal is to find the sentence hypothesis that maximizes the posterior probability of the word sequence given the acoustic observation. Usually speech recognizers are evaluated by measuring the word error so that there is a mismatch between the training and the evaluation criterion. Recently, algorithms for minimizing directly the word error and...
متن کاملLV-ROVER: Lexicon Verified Recognizer Output Voting Error Reduction
Offline handwritten text line recognition is a hard task that requires both an efficient optical character recognizer and language model. Handwriting recognition state of the art methods are based on Long Short Term Memory (LSTM) recurrent neural networks (RNN) coupled with the use of linguistic knowledge. Most of the proposed approaches in the literature focus on improving one of the two compo...
متن کاملAutomatic Generation of Pronunciation Dictionaries
In this report we will describe a data driven approach for creating pronunciation dictionaries for a new unseen target language by voting among phoneme recognizers in nine different languages other than the target language. In this process recordings of the new language that are transcribed on word level are decoded by the phoneme recognizers. This results in a hypothesis of nine phonemes per t...
متن کاملCombining forward-based and backward-based decoders for improved speech recognition performance
Combining outputs of speech recognizers is a known way of increasing speech recognition performance. The ROVER approach handles efficiently such combinations. In this paper we show that the best performance is not achieved by combining the outputs of the best set of recognizers, but rather by combining outputs of recognizers that rely on different processing components, and in particular on a d...
متن کاملSpoken Term Detection Using Phoneme Transition Network from Multiple Speech Recognizers' Outputs
Spoken Term Detection (STD) that considers the out-of-vocabulary (OOV) problem has generated significant interest in the field of spoken document processing. This study describes STD with false detection control using phoneme transition networks (PTNs) derived from the outputs of multiple speech recognizers. PTNs are similar to subword-based confusion networks (CNs), which are originally derive...
متن کامل